ggml : reduce hash table reset cost #8698

slaren · 2024-07-26T01:03:08Z

Reduces the reset cost of hash tables by using a bit table to indicate if a slot is in use, instead of a NULL pointer. In this way, the amount of memory that needs to be cleared when resetting a hash table is reduced by a factor of 64 (32 in 32-bit systems). The difference in performance is only measurable in small models, but with very small models like stories260K, it can be significant. Since the CUDA backend support async computation, the hash table resets could be done in parallel with the computation, hiding most of this latency, however, in the case of the stories260k model the reset was more expensive than the computation.

Other changes included in this PR:

Refactor of GGML_ASSERT to move most of the code to an external function call ggml_abort to reduce code size
Adds GGML_ABORT to abort the process with a message
Replaces GGML_ASSERT(false) with GGML_ABORT
Adds GGML_NORETURN to tag functions that not return, used in ggml_abort
As a consequence of the previous points, code after GGML_ABORT will generate an "unreachable code" warning in clang (gcc does not have this warning). I have fixed these warnings in a separate commit to simplify the review.
Adds -g to all the make release builds to include debug symbols. This should not affect performance, but will improve the accuracy of call stacks.
Moves some of the ggml hash functions to the header file to allow the compiler to inline them
Minor improvements in reliability and performance of hash tables and ggml_backend_sched

GPU	Model	Test	t/s master	t/s sl/hash-improvements	Speedup
RTX 3090 Ti (CUDA)	stories260K	pp512	329655.68	344158.50	1.04
RTX 3090 Ti (CUDA)	stories260K	tg128	1817.33	2171.60	1.19
RTX 3090 Ti (CUDA)	llama 1B Q8_0	tg128	374.48	388.12	1.04
RTX 3090 Ti (CUDA)	llama 7B Q4_0	tg128	162.92	164.17	1.01
i9-13900K (CPU)	stories260K	pp512	20665.15	21438.21	1.04
i9-13900K (CPU)	stories260K	tg128	3104.72	4727.22	1.52
i9-13900K (CPU)	llama 1B Q8_0	tg128	65.99	67.75	1.03
M3 Max (Metal)	stories260K	pp512	62813.44	63433.15	1.01
M3 Max (Metal)	stories260K	tg128	1389.56	1449.33	1.04
M3 Max (CPU)	stories260K	pp512	48704.42	49237.79	1.01
M3 Max (CPU)	stories260K	tg128	2296.40	2516.05	1.10

slaren · 2024-07-26T01:40:26Z

It seems that older versions of gcc are not capable of understanding that code after GGML_ASSERT(false) is unreachable, which generates a different warning. I can see three options:

Disable -Wunreachable-code in clang
Disable -Wimplicit-fallthrough in gcc
Rollback the change to GGML_ASSERT and continue adding useless breaks and returns after GGML_ASSERT(false)

ggerganov

To solve the implicit fallthrough warnings, maybe we can add GGML_ABORT macro like this:

diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index d0311f6c..527dadfb 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -272,7 +272,8 @@
 #define GGML_NORETURN _Noreturn
 #endif
 
-#define GGML_ASSERT(x) if (!(x)) ggml_abort(__FILE__, __LINE__, #x)
+#define GGML_ABORT(x) ggml_abort(__FILE__, __LINE__, #x)
+#define GGML_ASSERT(x) if (!(x)) GGML_ABORT(x)
 
 // used to copy the number of elements and stride in bytes of tensors into local variables.
 // main purpose is to reduce code duplication and improve readability.
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index 3fd9c4fe..a6cbb0ae 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -17924,7 +17924,7 @@ static void ggml_compute_backward(struct ggml_context * ctx, struct ggml_tensor
                         } break;
                     case GGML_UNARY_OP_TANH:
                         {
-                            GGML_ASSERT(false); // TODO: not implemented
+                            GGML_ABORT("not implemented");
                         }
                     case GGML_UNARY_OP_ELU:
                         {

slaren · 2024-07-26T15:11:49Z

While working on this, I also found that building the graphs in the build_llama etc functions, takes almost as much time as splitting and allocating the graph in ggml_backend_sched_alloc_graph, which is completely unreasonable. I tested a few changes to see how much this could be improved:

base:                        0.094 ms
no cb:                       0.074 ms (-0.020 ms)
no GGML_ASSERT:              0.063 ms (-0.011 ms)
no fmt name in expand:       0.048 ms (-0.015 ms)
remove format_name entirely: 0.032 ms (-0.016 ms)
remove expand (*):           0.028 ms

(*) add tensors to the list of nodes as they are created, rather than doing a DFS in ggml_build_forward_expand to build the list afterwards.

So the conclusion is that this could be improved significantly, reducing the need of complicated logic to reuse the graphs as in #8366. Simply removing the calls to ggml_format_name should reduce the cost of building the graphs significantly.

ggml-ci

* ggml : reduce hash table reset cost * fix unreachable code warnings after GGML_ASSERT(false) * GGML_ASSERT(false) -> GGML_ABORT("fatal error") * GGML_ABORT use format string

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 26, 2024

slaren force-pushed the sl/hash-improvements branch 2 times, most recently from e0dbcdd to 5fd4cef Compare July 26, 2024 02:48

ggerganov approved these changes Jul 26, 2024

View reviewed changes

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Kompute https://github.com/KomputeProject/kompute/ labels Jul 26, 2024

slaren force-pushed the sl/hash-improvements branch 2 times, most recently from dccd09d to cdc02b7 Compare July 26, 2024 14:39

slaren added 4 commits July 27, 2024 03:47

ggml : reduce hash table reset cost

ac6ce60

fix unreachable code warnings after GGML_ASSERT(false)

dd84a88

GGML_ASSERT(false) -> GGML_ABORT("fatal error")

3ed1bc0

GGML_ABORT use format string

ae5331d

ggml-ci

slaren force-pushed the sl/hash-improvements branch from 8ccf979 to ae5331d Compare July 27, 2024 01:48

slaren merged commit 2b1f616 into master Jul 27, 2024
58 of 59 checks passed

slaren deleted the sl/hash-improvements branch July 27, 2024 02:41

MengqingCao mentioned this pull request Aug 7, 2024

cann: add Ascend NPU support ggerganov/whisper.cpp#2336

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : reduce hash table reset cost #8698

ggml : reduce hash table reset cost #8698

slaren commented Jul 26, 2024 •

edited

Loading

slaren commented Jul 26, 2024

ggerganov left a comment •

edited

Loading

slaren commented Jul 26, 2024

ggml : reduce hash table reset cost #8698

ggml : reduce hash table reset cost #8698

Conversation

slaren commented Jul 26, 2024 • edited Loading

slaren commented Jul 26, 2024

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

slaren commented Jul 26, 2024

slaren commented Jul 26, 2024 •

edited

Loading

ggerganov left a comment •

edited

Loading